Interpretable Multi-Head Attention
Interpretable Multi-Head Attention
ls-type:: annotation
hl-page:: 9
hl-color:: yellow
相对于 [[Multi-Head Attention]] 的修改
- 参数角度 #card
针对 V 是多头共享参数 share values in each head
ls-type:: annotation
hl-page:: 9
hl-color:: yellow
,对 Q 和 K 是多头独立参数
- 每个头使用不同的值,仅凭注意力权重无法表明特定特征的重要性 Given that different values are used in each head, attention weights alone would not be indicative of a particular feature’s importance.
ls-type:: annotation
hl-page:: 9
hl-color:: yellow
- Attention score 使用方式 #card
计算多头 attention score 加权后的 V(求平均), employ additive aggregation of all heads
ls-type:: annotation
hl-page:: 9
hl-color:: yellow
,原始方法中是 concat
#card InterpretableMultiHead $(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=\tilde{\boldsymbol{H}} \boldsymbol{W}_H$ 公式
- $\begin{aligned} \tilde{\boldsymbol{H}} & =\tilde{A}(\boldsymbol{Q}, \boldsymbol{K}) \boldsymbol{V} \boldsymbol{W}V \ & =\left{1 / H \sum{h=1}^{m_H} A\left(\boldsymbol{Q} \boldsymbol{W}_Q^{(h)}, \boldsymbol{K} \boldsymbol{W}_K^{(h)}\right)\right} \boldsymbol{V} \boldsymbol{W}V \ & =1 / H \sum{h=1}^{m_H} \text { Attention }\left(\boldsymbol{Q} \boldsymbol{W}_Q^{(h)}, \boldsymbol{K} \boldsymbol{W}_K^{(h)}, \boldsymbol{V} \boldsymbol{W}_V\right)\end{aligned}$
Interpretable Multi-Head Attention
https://blog.xiang578.com/post/logseq/Interpretable Multi-Head Attention.html